-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Separately check whether pyarrow
and cffi
are installed
#6785
base: master
Are you sure you want to change the base?
Conversation
pyarrow
and cffi
are installedpyarrow
and cffi
are installed
pyarrow
and cffi
are installedpyarrow
and cffi
are installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only tests I can think of would require a runner with pyarrow but not cffi installed.
Could you try adding tests that mock cffi
not being available by mocking sys.modules
? I tried that in an environment with scikit-learn
(another optional dependency of lightgbm
) installed and it seemed to work ok:
import sys
from unittest import mock
with mock.patch.dict(sys.modules, {'sklearn': None}):
import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# False
import lightgbm as lgb
print(lgb.compat.SKLEARN_INSTALLED)
# True
We don't have any examples of that in lightgbm
's test suite, but I think it'd be interesting to try.
It appears this does not work. I don't really understand why. |
@jameslamb How would you like to continue here? |
I feel it'd be easy to accidentally undo this work in future refactorings. I will try to find a way to add a test covering this. |
python-package/lightgbm/basic.py
Outdated
if not PYARROW_INSTALLED: | ||
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` installed.") | ||
if not (PYARROW_INSTALLED and CFFI_INSTALLED): | ||
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise LightGBMError("Cannot init dataframe from Arrow without `pyarrow` and `cffi` installed.") | |
raise LightGBMError("Cannot init Dataset from Arrow without `pyarrow` and `cffi` installed.") |
This really should be Dataset
, not dataframe
... I'll make that change when I push testing changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e72d5e2.
In that commit, I also removed backticks from these log messages, in favor of single quotes. Special characters in log messages can occasionally be problematic.
I know these things were already there before this PR, but might as well fix them right here while we're touching these lines.
def missing_module_cffi(monkeypatch): | ||
"""Mock 'cffi' not being importable""" | ||
monkeypatch.setattr(lightgbm.compat, "CFFI_INSTALLED", False) | ||
monkeypatch.setattr(lightgbm.basic, "CFFI_INSTALLED", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came up with this based on https://docs.pytest.org/en/stable/reference/reference.html
I'm hoping that this could establish a pattern we re-use in other tests in future PRs.
It's not perfect (for example, if setting CFFI_INSTALLED
is done incorrectly in compat.py
, then this approach wouldn't catch that), but it's a lightweight and simple way to ensure we always cover code like the changes introduced in this PR.
Referenced https://docs.pytest.org/en/stable/reference/reference.html while working on this.
@jmoralez @borchero @StrikerRUS what do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks quite fragile. I think more complicated but right approach would be like one from the following ones: https://stackoverflow.com/a/51048604.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a lot more complicated :/
I'll try it though, I do agree that it'd be a stronger test to go all the way into getting the import to literally raise an ImportError
.
generate_dummy_arrow_table(), | ||
label=pa.array([0, 1, 0, 0, 1]), | ||
params=dummy_dataset_params(), | ||
).construct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.construct()
is necessary here... __init_from_pyarrow_table()
is not run as part of lgb.Dataset()
.
I pushed changes here, my review shouldn't count towards a merge.
Ok, think I found a pattern that'll work for this testing! I just pushed e72d5e2, proposing that and adding some other small fixes. Let me know if it looks ok to you @mlondschien . I've also dismissed my review... now that I've made such significant edits here, my review shouldn't count towards a merge. @StrikerRUS @borchero @jmoralez could one of you help with a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor suggestions below.
CFFI_INSTALLED = True | ||
except ImportError: | ||
CFFI_INSTALLED = False | ||
|
||
class arrow_cffi: # type: ignore | ||
"""Dummy class for pyarrow.cffi.ffi.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need
CData = None
addressof = None
cast = None
new = None
class members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CData
is needed for type hinting:
LightGBM/python-package/lightgbm/basic.py
Line 416 in 9f1af05
chunks: arrow_cffi.CData |
But I think the others could be safely removed. It's only showing up in the diff in this PR because this code is being moved around... so this was missed in earlier PRs (I guess #6034).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all but CData
in 7396613
with pytest.raises( | ||
lgb.basic.LightGBMError, match="Cannot predict from Arrow without 'pyarrow' and 'cffi' installed." | ||
): | ||
bst = lgb.train( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lgb.train()
part should be outside of the with
block because we expect only predict()
should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right, thank you!
Fixed in 7396613
Co-authored-by: Nikita Titov <[email protected]>
The only tests I can think of would require a runner with
pyarrow
but notcffi
installed.Note that the
LightGBMErrors
will only raise whenpyarrow
is installed, butcffi
is not. Ifpyarrow
is not installed,pa_Table
is a dummy class andisinstance(data, pa_Table)
returnsFalse
.This is a breaking change for users who didn't install
lightgbm[arrow]
, but rather just installedlightgbm
andpyarrow
separately. Even if not intended, they could previously train a model on apyarrow.Table
, as this was converted via to ascipy.sparse.csr_matrix(data)
. The fix is simply to installcffi
or to transform manually withscipy.sparse.csr_matrix
.Still, it is good to inform people that they are not "natively" training from a
pyarrow.Table
, incurring an unnecessary copy.As already suggested in #6782, an alternative would be to raise a warning.